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Abstract 

This article proposes a convenient tool for decoding the output of neu¬ 
ral networks trained by Connectionist Temporal Classification (CTC) for 
handwritten text recognition. We use regular expressions to describe the 
complex structures expected in the writing. The corresponding finite au¬ 
tomata are employed to build a decoder. We analyze theoretically which 
calculations are relevant and which can be avoided. A great speed-up re¬ 
sults from an approximation. We conclude that the approximation most 
likely fails if the regular expression does not match the ground truth which 
is not harmful for many applications since the low probability will be even 
underestimated. The proposed decoder is very efficient compared to other 
decoding methods. The variety of applications reaches from information 
retrieval to full text recognition. We refer to applications where we inte¬ 
grated the proposed decoder successfully. 


1 Introduction 

Sequence labeling is the task of assigning a (class) label to each position of 
an incoming sequence such as speech or handwriting recognition. These tasks 
are typically very complex and even subproblems are challenging. This article 
focuses on the decoding problem i.e. finding the most likely label sequence for 
a given output of a classifier such as neural networks (NNs), Hidden Markov 
Models (HMMs) or Gonditional Random Fields (GRFs). 

Deep learning methods has pushed the research of complex tasks such as 
handwritten text recognition (see |H]). The special needs of such complex tasks 
require advanced decoding methods. For example, a typical subproblem in full 
text recognition is structuring the recognizers output into a sequence of regions 
of words, punctuations and numbers. In many cases, the most likely label 
sequence yields an acceptable segmentation. However, it happens that this label 
sequence is not feasible i.e. it does not match the expected structure and has to 
be corrected. Finding the optimal feasible structure is one of many applications 
of this article. For this aim, we describe feasible structures by regular expressions 
- a powerful pattern sequence which is used in nearly all computational text 
processing systems such as text editors and programming languages like Java or 
Python. We then derive an algorithm based on finite automata that yields the 
most likely label sequence fitting the previously described regular expression. 
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Beyond finding the optimal feasible label sequence fitting an expected struc¬ 
ture (regular expression), we gain several other features since we also consider 
the functionality of capturing groups. A capturing group defines a part of the 
regular expression. The associated part of the matching label sequence can be 
used to structure the decoding result for further analysis. In case of our previous 
example, we obtain a complete segmentation into words, numbers and symbols 
without additional parsing facilitating the calculation of the matching subse¬ 
quence and the likelihood. We just define word, number and symbol capturing 
groups. The complete decoding can be done in a few lines of code. 

Keyword spotting is another obvious application which can be solved very 
conveniently. The keyword is either the beginning of the line or there is a space 
or another separating symbol (quotation marks, opening parenthesis, etc.) be¬ 
fore the keyword. With the common notation of regular expressions, this pattern 
may be captured by inserting (. * (?<pre> [ "(-]))? before the keyword which 
means: If there is anything before the keyword, it ends with at least one of the 
aforementioned symbols. This last symbol (if there is one) is contained in the 
capturing group pre. Information about a group like its probability, containing 
text or its positions in the sequence are very important for the keyword spot¬ 
ting and will be provided directly by the derived algorithm. A low probability 
of the pre-group, for example, might indicate that a letter is more likely than 
our separating symbol such that the spotted character sequence is only part of 
a larger word. Analogously, there is an equivalent group after the keyword. 

Regular expressions can be very complex and the calculation of the proba¬ 
bility of all feasible sequences can be very time consuming. We give an approx¬ 
imation of the most likely label sequence which we motivate theoretically and 
experimentally. The approximation is also fundamental to the proposed decoder 
since a conventional A*-search suffers from a combinatorial explosion of all fea¬ 
sible sequences and leads to inefficient decoding times. It is developed for neural 
networks trained hy Connectionist Temporal Classification (CTC). Thus, CTC- 
trained systems are assumed all over the paper. Some of the currently most 
successful handwriting recognition systems were trained with CTC as shown in 
several competitions. To give just one example, the probably most challenging 
real world task is the Maurdor project which was won by A2IA in 2014 using 
CTC (see [H]). CTC is not limited to text recognition. Recently the perfor¬ 
mance of several speech recognition systems trained with CTC equaled those of 
other state of the art methods (e.g. mm)- 

The proposed algorithm is an essential part of the award winning systems 
|18| and m which were also trained with CTC. Recently, the system reaffirmed 
the capability by winning the HTRtS15 competition mi- 

The performant connection between regular expressions and machine learn¬ 
ing algorithms has been investigated in previous articles. In the context of 
speech recognition, m showed in detail how to incorporate static prior knowl¬ 
edge like n-grams or phoneme models into finite state transducers. Although 
the authors exploit similar models to do the decoding, the purpose differs from 
ours since they model more static connections between ton, speech and language 
while we aim at a flexible, adaptive decoding algorithm. Earlier, jj provided a 
comprehensive analysis of links between probabilistic automata (i.e. automata 
with a probabilistic transition) and HMMs from a theoretical point of view fi¬ 
nally concluding - among other - that there is a correspondence between both 
models. This basically means, HMMs can be seen as the probabilistic version 
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of finite automata. 

Some links between regular expressions, their corresponding automata and 
HMMs are given in m- The authors showed how to create HMMs from regular 
expressions to detect biological sequences. A similar but generalized approach is 
given in |3] . There the authors construct a simplified HMM model for a general 
text line in the context of word spotting. These text line models basically consist 
of the keyword surrounded by space and filler models. They also proposed an 
enhanced model where only the prefix or suffix of the keyword is given. This 
model allows a set of feasible words containing the defined prefix or suffix. 

Recently, Bideault et al. published a similar approach to ours in P. They 
proposed an HMM - BDLSTM hybrid model for word spotting exploiting reg¬ 
ular expressions. Their model uses the posterior probability of the network as 
emission probability of the HMM (which means using P{y\x) as estimator for 
p{x\y), where x is the hidden variable and y is the observation). Analogously 
to [9], they build small HMM models in advance (e.g. for a keyword, for digits 
or letters) and combine them to a model capturing the regular expression. The 
authors then applied their model to keyword and “regex” spotting. 

In contrast to the above articles, we do not make use of an HMM model. 
Yet in P, the HMMs work only as convenient graphical model for decoding 
rather than as classifier. Instead of using a generative model to find the most 
likely sequence, our algorithm is based on the original graphical structure of the 
regular expressions: The finite state automata. If the automaton accepts a label 
sequence, it is feasible. Hence, we are able to search in the output of a neural 
network for any regular expression without any previously created or trained 
generative model. That means as input simply serve a regular expression and 
the network’s output matrix and the output is the most likely sequence, their 
probability or the capturing groups defined by the regular expression. 

The remainder of this article is organized as follows: We first give a formal 
definition of decoding (Section . In Section we give a brief introduction 
to regular expressions and automata. Furthermore, we modify the automa¬ 
ton slightly to adapt it to the NN-decoding requirements. We introduce the 
RegEx-Decoder in Section]^ We finish with some experiments (Section]^ and 
a conclusion. The appendix provides the proofs of our theorems for theoretically 
interested readers. 


2 Training and decoding 

This section introduces the CTC training scheme for neural networks and some 
basic aspects of their decoding. We mainly follow the notation of [3]. 

Let E be the alphabet and E' = E U {*} where * is an artificial garbage label 
(also called blank) indicating that none of the labels from E are present. We 
call the garbage label not a character (NaC) in the following. An element 
of E is called character and appears in the ground truth. Sequences from 
E* := (Jjgpj E* are called words. Elements of E' are labels and represent different 
classes of the NN. Sequences of (E')* are called paths. The most likely path 
is called best path. Assume a neural network which maps an input to a 
matrix Y G ' of probabilities per position and label. I.e. yt^i 

^In contrast to Y, both dimensions of X may vary. 
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denotes the probability for the /th label at position t. Note that we assume that 
Vt : yt,i = 1 Vt, I : yt^i > 0 throughout the paper. 

To map a path tt to a word z, one merges consecutive identical tt^ and deletes 
the NaCs. Let T : (E')* —)■ S* define the related function which maps a path to 
a word. More precisely: J^{Tr) = T>{S{Tr)) is the composition of two functions T> 
and S where S deletes all consecutive identical labels and T> deletes all remaining 
NaCs. 

We assume that the likelihoods yt^c are conditionally independent for distinct 
t given X. Thus, the likelihood of any path tt is given as 

T 

P(7r|X)=J]y,,^,. (1) 

t=i 

The probability of any word z is then the sum of the probabilities of all paths 
mapping to z: 


P(^|X)= ^ P(7r|X). 

7 rGJ^~^(z) 


Let z € (E')* be the extension of the word z G T,*, that means we add a NaC 
before z, after z and between each pair of characters. Thus, \z\ = 2\z\ +1. Then 
one could calculate P( 2 :|X) in an iterative manner: The forward variable ai(t) 
denotes the probability of the prefix zi,..., of z at time t given X and, 

hence, ai{t) denotes the probability of the empty word prefix. Thus, 

t t 

ai(^)=n = n y*’’*- 

t '^1 t '^1 

For t = 1, the other initial ai(l) are 

a2(l) = yi,z2 = 
ai{l) =0 Vi > 2. 

Then, probability of any prefix at time t is 

ai{t)=yt,zi F! (2) 


where 


(h{k) 


{k — 1, k} if Zk = Zk -2 or fc = 2 

{fc — 2, fc — 1, k} else 


The probability P(z|X) is then equal to the sum afzj (T) + of the two 

last forward variables at time T. Analogously, one can start at T and calculate 
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the suffix probabilities: 


%|(r) = l 

T 

/^iz|(^) = n yt',* 

t'=t+l 

^,(T)=0 yi<T-l 
P^{t) = l^k{t+l) 


where 




{k + 1, k} if Zk = Zk +2 or fc = 1^1 — 1 

{k + 2,k + l,k} else 


2.1 Connectionist Temporal Classification 

To optimize the log likelihood objective function 

0{z,X) = -lnP(z|X) ^ max, 

Connectionist Temporal Classification uses gradient decent. Hence, we need to 
provide the gradient 


dO{z,X) 

dyt,i 


P(z|X) 


E n 

*' = 1 

TZt—l ^ 


yt'.TTj, 


for any t G {1,.. ■ ,T} and I G S'. With the above defined a and /3, 

T |2| 


E n =E 


TrGJ^-bz) t' = l 
■Kt=l t'yt 


i=l 

Zj—l 


yt,i 


Starting with j th® standard backpropagation algorithm propagates error 

into the network and optimizes its parameters. A more detailed description can 
be found in |5]. 


2.2 Decoding 

During the prediction phase, we are interested in the z G S* with maximizes 
P(z|A’). Usually, there are conditions which allow only certain z G E*. A 
common example is the condition that z must be an element of a certain vocab¬ 
ulary V. If the allowed words are restricted to a finite vocabulary of reasonable 
size, one can find the most likely vocabulary item by calculating P(z|X) for 
each z G V individually using the forward probabilities a as introduced above. 
We call this decoding procedure string-by-string decoding since we calculate the 
word probabilities one after the other. We approximate the word likelihood by 
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the probability of its most likely path throughout this article by replacing the 
sum by maximum in eq. Q . The most probable path yields an alignment of po¬ 
sitions and class labels, it speeds up the calculation and - since there is typically 
one dominant path - it is a reasonable approximation to P( 2 |X). Thus, 

2 :*=argmax max P('7r|X). 

2GV irGJ='-i(2) 

3 Regular expressions and finite automata 

Finding the most likely label sequence following a special structure requires a 
tool for describing this structure. We model them as regular languages which 
have been developed to describe such complex structures (see [S]). There is a 
correspondence between regular languages / regular expressions and finite-state 
automata - a model of computation of that language. We use both - the regular 
expression to describe the set of expected sequences and the automaton to ex¬ 
ploit the transition graph during the decoding process. This section gives a brief 
introduction in the field of regular expressions and finite state automata. Read¬ 
ers who are already familiar with regular expressions and finite state automata 
may proceed with Subsection |3.1| 

Definition 1 (regular expression / regular language). The empty word e, the 
empty set 0 and a € E are regular expressions denoting the regular languages 
{e}, 0 and {a}, respectively. If C{ri) and C{r 2 ) are two regular languages 
defined by the regular expressions ri and r 2 , then also C{ri) U>C(r 2 ) = C{ri\r 2 ) 
(alternation, i.e. ri or r 2 ), L{ri)C(r 2 ) = C(rir 2 ) (concatenation of ri and r 2 ) 
and (£(ri))* = C{r\) (Kleene closure, i.e. the set of all finite sequences of words 
from C{ri)) are regular languages. There are no other regular languages than 
the above. 

Thus, regular expressions define languages containing specific sequences of 
literals from E. Those expressions can be represented in a model of computation. 
This model is known as 

Definition 2 (Automaton). The nondeterministic finite automaton (NFA) N 
is a 5-tuple (Q,E U {£},S,qo, F), where Q is the finite set of states, E is the 
alphabet, e is the empty word, ^ : Q x E U {e} —)■ 'P{Q) is the state transition 
function, qq G Q is the initial state and F C Q is the set of final states. 

We call N a deterministic finite automaton (DFA) iff Vg S Q : (5(q,e) = 0 
and yq G Q,a G ^ : \S{q, a)| < 1. 

For any regular expression there is an NFA accepting the corresponding 
language and the other way around. There may be more than one automaton 
accepting a regular language. Analogously, there may be more than one regular 
expression describing the same language. For any specific regular expression, we 
will create a corresponding NFA using Thompson’s Construction Algorithm (for 
details see m according to which any regular expression can be converted by 
some combination of the elementary NFAs depicted in Figure[^. An equivalent]^ 
DFA is obtained by the Subset Construction Algorithm. 

Generally, the subset construction algorithm generates a DFA with 2" states 
if n is the number of NFA states. m showed that there are languages which 

^Two finite automata are equivalent if they accept the same language. 


6 




(a) A\B (b) A* (c) AB 


Figure 1: Schematic representation of atomic NFAs resulting from Thompson’s 
Algorithm. A and B are regnlar expressions and Na and Nb are the related 
NFAs. Other quantifiers or operators can be expressed by those three. 


c 



(a) transition (b) substitution (c) leaf substitution 


Fignre 2: Illnstrations for extended NFAs. Donble circles represent final state. 


also require exactly 2" DFA states, i.e. the NFA is exponentially more succinct 
than the DFA. Instead of using DFAs, we snbstitnte the states of the NFA by 
their epsilon closure and use the resulting NFA. That means, we delete each e- 
transition and replace it by the next non-e-transition. The resulting antomaton 
will accept the same language as the original one. 


3.1 Adaptation to T 

The function B (see Section maps a label sequence to a word by deleting 
consecntive identical labels (5) and deleting NaCs {V). To allow optional NaCs 
between different characters during the decoding, we extended the word z to z. 
Analogonsly, we extend the transitions between the NFA-states the following 
way: Fignre 2(a) shows the transition which is substitnted by Figure |2(b)[ If 
<71 is final also is final. qi and 52 could even be the same state. Final 
leaf states (i.e. states without outgoing edges) are connected to another finale 
state by reading a NaC as shown on the Fignre 2(c)[ Algorithm [T] provides the 
pseudo code for extending the automaton. It accepts the langnage C{r) := {w G 
£(*?wi*?W 2 *? .. .★?W|„| *?)|w G >C(r)} of words interrnpted by optionaj^ AaCs. 
This is the adaptation to the I?-part of B. 

Instead of adapting N also to S, we leaf this step to the algorithm in Section 
1^ to simplify the notation. Since S deletes identical consecutive labels, the 
continuation of a read label is left (see the “cont” function in later sections). We 
call the antomaton adapted to V extended automaton and symbolize it by N. 


Example 3. We construct an antomaton accepting the langnage C = {cat, bat}. 
The naive alternation cat I bat of both words leads to an automaton with 14 
states nsing Thomson’s Construction and the above described extension. We 

®We use the regular expression notation (?) to mark symbols as optional. 
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Algorithm 1: extendAutomaton 

input : NFA (Q, S, S, qo, F) 
output: Extended NFA ,S,qQ,F) 

Q ■(— Qj 
F ^ F-, 
for q G Q do 

create new state q' \ 

Q ^ Q U {q'}; 

^ {?'}; 

for a S E do 

L ^ ^(9, a); 

if <7 S F then 

L F^FU{(z'}; 


could save 4 states and transitions by alternating only the first letters. The 
regular expression (c|b)at will generate the following automaton: 



If we aggregate the labels c and b like [be] at, we could save two additional 
states and even 5 transitions. Thus, instead of using multiple arcs for connect¬ 
ing the same states but reading different labels, we aggregate them into one 
transition: 


c |b a t 



Thus, there is at most one transition between any two states which reads possibly 
multiple labels. Obviously, any accepted label sequence produces an emission 
sequence collapsing to “cat” or “bat”. Note, that we need just 10 transitions 
where the decoding process from Section needs to calculate 7 table columns 
for each word. If we add the words fat, rat, hat to our list of accepted words, 
the conventional decoding of Section calculates 3.5 more table columns than 
there are transitions in the automaton. 


4 Efficient decoding of regular expressions 

Given a regular expression r and the corresponding extended NFA N = (Q, S', J, go; F), 
we search for the most likely word z* in C{r)-. 

z* = argmax max P(-7r|A'). 
zec(r) 



















In contrast to calculating the likelihood of every single feasible word from Or), 
we exploit the graphical structure of N to find z*. This can be done very 
efficiently if N is succinct (as e.g. in Example]^. 

4.1 A* and beam search 

In this subsection, we review two standard algorithms - the 4*-search and the 
beam search - which are standard approaches of decoding. 

Algorithm describes a naive A*-search algorithm on regular expressions 
that returns the most likely path. This algorithm yields the best result but 
it can be time consuming because of the huge number of possible paths. To 
cut unlikely paths, we define an upper bound P(7r, t|X) for the final probability 
P(7rT|A’) of the final path ttt starting with the prefix tt (position 1 to t). In 
our experiments, we filled up tt with a ^ suffix (i.e. t := Pt+i-.r) such that 
P(7r, := rir^i ris^t+i 2/s, /3s ■ Another heuristic which appears to work 

well in practice is to sort the prefix list L by . This sorting yields a quick 

first best guess such that unlikely paths can be deleted soon. 


Algorithm 2: A*-search 

input : Network output Y, extended NFA N = (Q, S', (5, go, F) 
output: most likely feasible path ir* 

for 7 S S' do 

for g' G S{qo,A do 

1^ Add (g', 7 ,1) to L; /* initialize L */ 

while L not empty do 

(g, 7r,t) ■<— Item from L with maximum lAjAiAlj 
Remove {q,ir,t) from L; 
if 2 < T then 

for 7 G S' \ {TTt} do 
for g' G S(q,j) do 
1^ Add (g', ir 7 , t + 1) to L; 

Add (g, TTTTt, t + 1) to L; /* cover the S paart of */ 
else if g G F then 
TT* ^ tt; 

Remove all (g',7r',t') G L with P{7r',t'\X) < P(-7r|A’); 


Since the number of feasible paths grows exponentially in the worst case, 
there is a standard heuristic to reduce the search space called beam search. For 
example in |7], the authors introduced a beam search algorithm for efficient 
decoding in case of speech recognition which allows only n prefixed at any posi¬ 
tion. Algorithm 1^ contains its pseudo code adapted to our problem. Generally, 
beam search does not guaranty to find the optimal sequence. The given algo¬ 
rithm has the additional drawback that it does not even guaranty to find any 

is called the beam width. 
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feasible path at all since the final list L could contain only (g,7r, T) with q ^ F. 


Algorithm 3: beam search 

input : Network output Y, extended NFA N = {Q, E', S, qo, F) 
output: most likely feasible path w* 

for 7 S S' do 

for q' G S{qo,j) do 

1^ Add (( 7 ', 7 ,1) to L; /* initialize L */ 

for i •(— 2 to T do 

L ^ the n most likely item of L; 

L^{}; 

for {q, TT, <) G L do 

for 7 G S' \ {iTt} do 

for q' G Q : q' G 6{q, 7 ) do 
1^ Add (g', ir 7 , t + 1) to L; 

Add (g, TTTTj, t + 1) to L; /* cover the S paart of F */ 
TT* •<— TT from (g,ir,t) G L with maximum P(7r|A’) and q G F; 


4.2 RegEx-Decoder 

In this subsection, we introduce another decoding algorithm which exploits the 
structure of the given automaton and thus is more efficient than the A*-search 
and guaranties - under mild conditions - to return the most likely path at the 
same time. In contrast to the token passing algorithm from [10], one transition 
may read several input labels. We finally show that considering the three most 
likely labels per arc and position is sufficient. This feature allows us to prepro¬ 
cess the network output Y such that each arc only processes the most likely of 
their reading outputs which avoids unnecessary calculations. Additionally, we 
keep less paths compared to the token passing algorithm. 

Let n(t,g',g) be the set of prefixes of F~^{C{r)) of length t on condition 
that the automaton moves from state g' to g at position t. Instead of keeping 
all possible prefixes, we only keep one prefix per arc, label of that arc and 
time point: The probability of most likely prefix from n(<, g', g) is denoted by 
Multiply labeled arcs have different super scripts i of a;l g, g each of 
them corresponding to a different label of (g',g). The aj g, g can be calculated 
iteratively by 

a\ g, g = max P(7r|A’) 

5rGn(i,g',(3) 

where Q. g, g denotes the ending label tt^ of the specific t-prefix tt G n(t,g',g) 
which has a likelihood of a\ g, g (i.e. Qg. g + Cig>,g for i ^ j)|^ If we maximize 

®Let o* ^ be the probability of the most likely prefix tt®, then ttJ = Q g- Further, 
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over an empty set, we assume the result is zero. Let a be a variable containing 
ctt,q',q (<Z^9)• Let V{q) = {q' € Q \ 3a € : q € S{q',a)} be 

the set of predecessor states of q. 

Remark 4. Let N be the extended automaton with respect to a regular ex¬ 
pression r and a as defined above. The probability of the most likely path tt* (r) 
with J^( 7 r*(r)) G C(r) is given by 


P(,r*(r)|X) 


max max P(ir|X) 

zeC{r)ireJ^-i(z) 


max max PIttIX) 

qepy er{q) ireniTy ,q) 


max 

qeF,q'er{q) 


a 


1 

T,q',q- 


Thus, we only need ^ to calculate the likelihood of r with respect to Y. 
Unfortunately, we need also the preceding a* for t > 1 to calculate o:^. 

Let 


7 t.q',g := argmax yt,a 

qeS(q',a) 


be the ith likely label per arc {q', q) and position t. This especially means 
Vty ^ Vty > ... ■ Obviously, the initial values of a are cxl , = y^ y 

’ ’t,q',q ’ ^t,q',q ’ ^l,q',q 

if q' = qo and al g> g = 0 else. 


Remark 5. Note that every non-NaU-arc represents a character or a group 
of equivalent characters of the regular expression. Thus, two consecutive arcs 
must not read the same label a € S at consecutive positions since this means 
moving two characters forward in the accepted word. But a sequence of identical 
consecutive labels is mapped to one character by which means allows 
only one step forward. Thus, if the most likely previous arc {q",q') ends on 
Ct-i q" q' ~ q’ 91 Calculate the t-prefix probability by either combining the 
most likely (t— l)-prefix not reading 7 ^ ^ with 7 ^ ^ or we keep the most likely 

{t— l)-prefix extending it by the second most likely label 7 ^^, In the first case, 
we have to calculate also for all arcs. 


There are two possible types of contributions to calculate a\ We either 
come from a previous arc (i.e. append a new label) or we continue reading 
the label of the previous prefix through {q',q) (i.e., stay on the arc and cover 
the S part of iF). For the most likely t-prefix the likelihood ^ is obviously 
calculated by 


<^t,q',q = max{app(t, q', q, 1 ), cont(t, q', q, 1 )} 


where 

app(Lg',g, 1) = max max{af_i „ ,yt,a | a G E' \ {Cf_i „ : q G 6{q',a)} 

q''eP{q') k,a 

cont(t, q', q, 1 ) = max | at_i g,„ytri‘ , \ ■ 
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A straight forward generalization with the additional restriction not to read 
Ciq' q (j < *) leads to the general calculation schema of 

<A,q'q = max {app(t, q'q, i), COnt(t, q', q, *)} (3) 


where 

app(t, q', q, i) = max max | „ ,yt,a\ 

q"eV{q') k,a t ’ 

a e S' \ ({Ct-pq".,'} u {Clq',q\j < o) : 9 ^ 

cont(t, 9 ',g,i) = max JVj < i : C,t-i,q',q + Ciq',q]- 

Starting from qQ, we now calculate aj ^ for each i, arc (g', q) and time point 
t. The maximum aip ^ for q G F will be the maximum probability of all 
feasible paths. We yet even reduced the search space by keeping only one pre¬ 
fix probability per arc, allowed label of the specific arc and time point. This 
means, we have a polynomial time complexity (instead of an exponential time 
complexity as the A*-search). More precisely, the calculation of a requires 
0{T\^'\EqeQEq,enq) |P(g')|) multiplications in the worst case. Although the 
running time seems to be cubic in the number of states \Q\, in practical applica¬ 
tions, the number of predecessors of each state is typically limited by a constant. 
Thus, the expected running time is rather linear in Q. 

The most likely path can be found via simple backtracking. 

Speed-up 

In the following, we analyze the most likely paths of a and speed-up the process 
by avoiding unnecessary calculations. The speed-up is based on two theorems 
which finally lead to a time complexity which is independent of the number of 
labels in S'. The first theorem states that it is sufficient to know a] and 
every q” G P((?') to calculate both app(t-I- l,g',g, 1) and app(f-I- 
l,g',g, 2). Additionally, we only need the three most likely probabilities yt+i,a 
per arc and time step no matter how many labels allow to move from q' to q. 

Theorem 6. Let T{t,k,q'',q',q) := { 7*^.9 I ■?' ^ {!> 2,3}} \ let's!, the 
three most likely labels without the previous ending label Q_i g,. Then for 
i = l,2 eq. 0 simplifies to 

app(t,9',9, 1) = max{a^_j^ ,j„ g,yt,aW' e P(9')> k G {1,2}, a G T{t,k,q",q',q)} 
app(t,9',9,2) = max{a^_j^ ,j„ g,yt,aW' e P(9')> k G {1,2}, 
aGT{t,k,q'',q',q)\{Clg'q}} 

The proof of Theorem can be found in the Appendix. An equivalent 
statement using the same values as in Theorem for the calculation of the 


(4) 

(5) 
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likelihood of consecutive identical labels calculates cont{t, q',q,i) as 


cont{t,q',q,l) =ma.x\a'l_j^ , y^.k | fee {1,2}: 

Ct-w.,G{7tV.,b’ = 1 ^ 2 , 3 }} 

cont(t, q', q, 2) = max J fc G (1, 2} : ^ 


( 6 ) 


Although cont(t, q', g, i) 7 ^ cont(t, g', g, i) in general, the conditions given in 
Theorem are sufficient to ensure that this approximation does not influence 
the final probability. 


Theorem 7. Let ir* be the most likely feasible path with respect to the regular 
expression r. Assume the following conditions: 

1. Vt : (tt}, ..., 7r(^„) = o" G S" ^ n < 2 (i.e. tt* contains at most 2 
consecutive identical labels from S) 

2. Vt : |{a G S : yt^a > 2/t,*}| < 3 (the NaC is one of the three most likely 
labels at each position) 

Then, there is a g G F such that a}p ^ = P(7r*|A’) if a is calculated using ([^ 
as substitution for 

Again, the proof can be found in the Appendix. 


Remark 8 . Errors only appear for arcs reading more than 2 characters. We 
call these arcs critical. 

The conditions of Theorem are not unlikely to occur in Recurrent Neural 
Networks trained with CTC. The NaC is always very probable (condition]^ and 
the likelihoods of other labels are often very spiky (condition i.e. one rarely 
observes more than two consecutive identical labels in the best path except for 
the NaC. (In [2] they call this the dominance of blank predictions.) 

Remark 9. Theoremand [7] allow us to preselect the most likely channels per 
arc and position. The calculations of any arc can be reduced by calculating the 
probability of prefixes ending on the three most likely labels of the considered 
arc. 


The calculation of ^ requires aj for all g" G P(g') and aj 

Thus, there are two possible chronological orders to calculate the aj 

1. Fix t and calculate aj ^ starting at gg before moving on to t + 1. 

2. Fix (g',g) and calculate a\ ^ for all t before moving on to the successor 
states. 

We suggest the second variant mainly because of computational reasons. Fin¬ 
ishing the calculation of one state allows to keep the necessary values in the 
cache and promises a fast calculation. However, we did not test the first vari¬ 
ant. The downside of the second variant is that we must not allow circles of 
length greater than one for the automaton N (which results is circles of length 
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2 in the extended automaton). Otherwise, we would require information of sub¬ 
sequent (not yet calculated) arcs. This restriction forbids to use the Kleene star 
operator in any regular expression. To allow at least the Kleene star for single 
characters or character groups, we calculate all transitions depicted in Figure 
| 2 (b)| at once whenever qi = < 72 . 

The Algorithms and show the pseudo code of the proposed algorithm. 


Algorithm 4: RegExDecoder 

input : Network output Y, regular expression r 
output: Likelihood p = P(7r*(r)|A’) 

iV •<— createNFA(r); // Thompson’s Construction Algorithm 

iV •<— extendAutomaton(N)\ // Algorithm 

for q G Successor((7o) do 

^ calculate iat!qo,q)'t=i > // Algorithm 

p = 0; 

for q G F do 

^ foreach q' G P(g) do p ^ max{p, 


Capturing Groups 

As already mentioned, information about a part of the regular expression can be 
crucial. In case of keyword spotting for example, the likelihood of the keyword 
determines whether or not the current spot is accepted. But also the likelihood 
of labels next to the keyword are important to decide whether or not the spotted 
word is only a part of a larger word. To connect parts of the regular expression 
with parts of the automaton, we take advantage of the notation of capturing 
groups: 

A capturing group p of a regular expression r is a consecutive part within a 
pair of parentheses. Thus, the group is related to certain arcs of the automaton. 
Hence, only if the most likely path related to r makes use of any arc related 
to g, g captures some part of the current output Y. Then, the captured label 
sequence is the part of most likely path n* (r) read by the subautomaton related 
to g. In a straight forward way, one calculates the probability or the bounds 
(start and end position) of g according to ■ 7 r*(r)j^ 

Vocabularies 

Typically, a decoding process includes one or more vocabularies. The regular 
expression of such a vocabulary can be expressed as an alternation of words. 
The optimal automaton accepting a collection of words is well know: The de¬ 
terministic, acyclic finite state automata (DAFSA). There are very efficient 
algorithms for a constructing a corresponding minimal DAFSA (see [3]). The 
number of arcs decreases dramatically compared to alternating the vocabulary 
words naively. 

®Since the NaC is not part of the regular expression, one may decide whether or not the 
likelihood calculation and the optimal path include the starting and tailing A^aC-labels. 
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Algorithm 5: calculate 


input : likelihoods a, ending labels extended NFA A 

if all a\ with q" € P((70 calculated then 
for i ^ 1 to 2 do 
if q' == <70 then 

^l,q' ,q ^ ^1,7! / ’ 

’ ^ ' 1,g',g 

Cl,g',g ^ 

else 

L “l.9'.9 ^ 

for t ^ 2 to T do 
appt ^ 0; 
contj ^ 0; 
for fc = 1 to 2 do 
for j = 1 to 3 do 
for g" e P(g') do 

if Ct-i,q",g' + lin’,q then 

apPt ^ max|appji , ^ ^ 

if Ct-I,q',q == lL',q then 

contji ^ max|contti , y, 

^ niax{apPi\contj}; 

update(Ct^,j/_,); // set Clg, g to the maximizing -yl^ g 

// calculate g, g analogously with the additional 
constraint not to read g, g 

foreach q € Successor(g) do calculate 





Table 1: Statistics over the text recognition experiment: “size” denotes the 
number of words of the vocabulary, arcs” denotes the number of arcs in 
automaton and critical arcs” denotes the number of arcs which read more 
than 2 labels, “greatest deviation - absolute” denotes the difference between the 
exact negative logarithmic probability and the result of the RegEx-Decoder. 
The “greatest deviation - relative” is the deviation divided by the exact absolute 
logarithmic probability. 


vocabulary 

size 

# total 

arcs 

^ critical 

greatest deviation 
absolute relative 

HTRtS 

9273 

12398 

12 

9.95E-14 

2.1E-12 

general English 

21698 

25997 

32 

9.95E-14 

2.1E-12 


Nevertheless, the number of arcs increases strongly for large vocabularies 
such that a fast and effective decoding process is impossible. 

5 Experiments 

The aim of this section is to show that the decoding works properly and fast. 
We show that the Algorithms and work correctly in practical applications 
and analyze situations when it fails. We compare our approximation of eq. ^ 
with the exact most likely path. Further applications of the RegEx-Decoder can 
be found in [TB] and m- 

We did all time statistics on a laptop with Intel i7-4940MX 3.10GHz CPU, 
32GB RAM and SSD. 

5.1 Text recognition 

First, we show that our approximation is reasonable for practical applications 
such as the HTRtS competition from the ICFHR2014 (see (I^). The data con¬ 
sists of 400 handwritten pages. We train on 350 pages and validate on 50 pages. 
The validation set is also used to evaluate the decoding. Each page consists 
of several lines of text including words, punctuations, numbers and symbols. 
The neural network used in [TB] generates the output matrices. We compare 
the most likely word of a vocabulary obtained by the RegEx-Decodei[^ with 
the result of the string-by-string decoding from Section For this purpose, 
the RegEx-Decoder is used to splits these matrices into regions of words and 
region containing spaces, numbers etc. The evaluation is done on the resulting 
4657 submatrices representing the word regions. These matrices correspond to 
outputs of subimages of single words. We use two vocabularies: one contain¬ 
ing 9273 words (generated from HTRtS data) and one containing 21698 words 
(a modern, general vocabulary made from two million English sentences from 
http://corpora.uni-leipzig.de/). 

Table shows the deviation of the negative logarithmic likelihood of the 
RegEx-Decoder and the exact decoding. Glearly, the deviation is negligible. 

^The automaton is generated using the strategy of Section 4.2 
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^ 55 ^ 55 ^ 551^/3211 

Figure 3: Artificial writings of two numbers for the number recognition task. 


There is an intersection of both vocabularies which includes especially the most 
frequent words. Thus, it is not surprising that both vocabularies show the 
same deviation since both extrema (the greatest absolute and relative deviation) 
appear for the same words (“General” and “of”). Since the number of critical arcs 
is very small, we expected a small divergence due to our approximation. In fact 
there is no additional confusion of words because of our approximation. Thus, 
the experiment shows that the approximation of Theorem can be applied in 
practical applications with few critical arcs. 

We evaluated the impact of the decoder empirically on the HTRtSlS test 
set. We decreased the word error rate by 3 percentage points compared to the 
best path decoding of the entire line (from 50.89% to 48.06%) just by defining 
an appropriate regular expression for the expected line structure without any 
vocabulary. Including a vocabulary, we further decreased the WER to 33.90%. 

5.2 Number recognition 

The next experiment involves artificially generated writings and investigates the 
correctness of Alg. in case of a relatively large number of critical arcs. By 
Remark we know that errors only appear for arcs reading more than two 
labels. We enforce this condition by searching only for digits. Thus, every 
arc not reading NaCs is critical since these arcs read more than two labels. 
To enforce further continuation error^ we vary the number of digits actually 
depicted in the image while the search pattern remains 3 to 5 digits (i.e. the 
regular expression is [0-9]-[3,5}). If the number of digits is greater than 5, the 
decoder has to suppress emissions which also promotes errors. 

We vary the number of digits from 4 to 9. For each number of digits, we 
generate 10,000 synthetic writings. The digits are narrowly written to enforce 
further confusions (see Figure]^. The resulting images work as input to four 
neural networks with different number recognition expertise. We will compare 
the decoding results over the output matrices generated by these networks. The 
RegEx-Decoder searches in these output matrices for the most likely number 
with 3 to 5 digits. The resulting number and probability is compared with the 
most likely number resulting from a traditional string-by-string decoding as in 
Section using a vocabulary of all numbers with 3 to 5 digits. Any difference 
in the resulting optimal path (but not its probability) is regarded as an error. 

Table shows the errors per network and digits in the image. The more the 
algorithm is forced to suppress digits the more errors occur. For 4 and 5 digits 
there is no force to suppress any written digit since the corresponding automaton 
is allowed to accept the ground truth. The errors are negligible in this case. 
However, although there are almost no errors in the resulting path, there are 

®Remember that errors only happen while calculating cont(t, q', q, i). 
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Table 2: Number recognition task: Number of differences in the most likely 
paths of the RegEx-Decoder and the exact decoding for different neural nets 
and different number of digits in the image but constant regular expression of 
[0-9] {3,5}. 



4 

5 

6 

7 

8 

9 

netl 

0 

0 

1 

12 

9 

27 

net2 

0 

0 

24 

27 

40 

40 

nets 

0 

0 

6 

3 

4 

4 

net4 

0 

1 

4 

4 

7 

7 


small differences between the probability of the string-by-string decoding and 
the RegEx-Decoder. From 6 to 9, digits there are already significantly many 
errors. 

Even if there is a relatively high number of critical arcs, there will be only 
little error if the regular expression fits to the image content. If it does not 
fit to the number of digits in the image there will be a high risk of generating 
additional confusion errors because of our approximation. However, even under 
exact decoding the best feasible path then has a very low probability which can 
only by further underestimated by the approximation. Hence the approximation 
will likely not be harmful here since the decoding process result can either be 
rejected immediately or it is unlikely to be of any significance in downstream 
processing steps. 

Figurej^shows the required decoding time for the above network outputs and 
regular expression. The RegEx-Decoder needs between 0.19 ms and 0.28 ms per 
network output on average. The conventional string-by-string decoding needs 
at least 4.68 ms per network output since it has to calculate the probabilities of 
more or less all numbers with the specific number of digits under consideration. 
To speed up the decoding time, this decoding method reuses already calculated 
probabilities whenever the beginnings are the sam^ Additionally, it stops the 
calculation of paths if the probability falls below the best yet found match. Even 
with this speed-up mechanism the RegEx-decoder is more than 22 times faster. 
The running time for the A*-search is growing exponentially as expected but 
the results match perfectly those of the string-by-string decoding. 

The beam search with beam value 100 needs almost seven times more time 
for the calculation than the RegEx-Decoder. A point of criticism might be that 
we use no independent implementation to compare the time complexity and we 
may not implemented the beam search algorithm optimally. Figure |^shows the 
corresponding extended automaton. Let us count the multiplications: Beam 
search with beam width 100 calculates for each of the 100 prefixes at each time 
step 11 new prefixes (one for each digit plus one adding the NaC) and thus 1100 
multiplications per time step in total in the worst case. The RegEx-Decoder 
calculates for each of the 10 arcs which read digits 6 new prefixes and for each of 
the 6 transitions requiring a NaC there is only one multiplication. Thus, we have 
66 multiplications in total. Therefore, our theoretical analysis rather indicates 
that the RegEx-Decoder is implemented suboptimally since beam search needs 

®I.e. 12345 and 12346 share all probabilities for the prefix 1234. 
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Figure 4: Decoding times for the number recognition task (10,000 output ma¬ 
trices) averaged over all four networks. 



Figure 5: Extended automaton accepting 3 to 5 digits and optional intermediate 
NaCs. 


16 times more multiplications. Although beam search with beam width 100 is 
much slower it yields significantly more errors (round about 40 errors on average 
if the ground truth are 4 or 5 digits). To get a comparable performance for the 
experiments with 4 and 5 digits, we need a beam width of at least lOOCj^ 

6 Conclusion 

In this article, we consider regular expressions for the decoding of neural network 
outputs. Regular expressions are a very efficient way to define a pattern of 
interest to search in text strings. We suggest to use this pattern for a convenient 
and clear decoding process. Similar results may also be archived by a smart 
evaluation of the best path. The advantage of regular expressions over individual 

'^'^The running time increases from round about 15 sec to 115 sec for all 10,000 output 
matrices. 
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evaluation of the output is the simple and unified notation. Furthermore, the 
proposed algorithm allows a highly adaptable decoding process since only the 
regular expression has to be changed. 

We show how to exploit finite automata to find the most likely feasible label 
sequence of a regular language. A further analysis of the decoding procedure 
yields a speed-up of the algorithm such that it also works fast for complex reg¬ 
ular expressions or many network outputs. We propose also an approximation 
which is shown to be exact under conditions which are commonly satisfied for 
CTC-trained networks. This theoretical result was confirmed by experiments. 
As a main result, we showed that the decoder is applicable in practical scenar¬ 
ios. Even if the approximation fails to produce exact results, it is likely that 
the ground truth does not fit to the regular expression. This results in a low 
probability decoding result further underestimated by our approximation which 
should not be harmful in most applications. Additionally, these experiments 
show that the proposed method is very efficient compared to state of the art 
decoding algorithms. 

The proposed speed-ups work only for the path probability P(7r|X) (instead 
of the word probability). If the decoder should return the exact probability, 
all paths contribute to the result and, thus, cannot be skipped. Hence, speed- 
ups seem to be hard. Additionally, we have to take care about distinct paths 
through the automaton accepting the same label sequence. An Unambiguous 
FSA or even a DFA is required to ensure that the automaton accepts every 
path (of labels) only once. We already discussed the disadvantages of DFAs in 
Section [S] 

There are plenty of applications for the proposed algorithm. The method 
can be applied e.g. to keyword spotting but also patterns of image retrieval 
tasks can be described conveniently. The proposed decoder is an essential part 
of our handwriting recognition systems e.g. for HTRtS (full text recognition) 
and ANWRESH (form reading) competitions. 
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Appendix 

Proof of Theorem^ For t = 1, the claim is correct since the most likely path 
of length 1 consists of the most likely character if q' = qq. Otherwise we obtain 
zero. 

Let t > 1. To keep things simple, we fix g" to consider only prefixes through 
{q",q') and {q',q). Therefore, let be the likelihood of the most likely 
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Figure 6 : Most likely suffix trough g", q' and q and possible combinations to 
calcnlate 

prefix through q'\ q' and q and let Q^ the read label at t. Then 

app(t + 1, q', q, 1) = max 0 (^ 1 , 9 ",g'.?- 
g"GP(g') ^ ^ ^ 

Analogonsly, let q be the most likely feasible seqnence through q”, q' and 

q not ending on Q Then 

app(t + 1, q', q, 2) = max a^t+i,q",q',q- 
The theorem is proven if 

«t+i.g", 9 '.g = maxjaj | i G {1,2}, a G r(t + 1, i, g", g', g)| 

«?+i.9".g'.g = ^^'^{^\,q",q'yt+i,a \ * G {l,2},a G r(t + 1, i, g", g', g) \ (CtVi,,',,}}- 

We make a case distinction, calculate the exact probability and show that 
(4+i,q",q',q Only dcpcnds on for i G (1, 2} and 74 + 1 . 9',9 ^ ^ 

For sake of simplicity, we omit the index g', g for ^ for the rest of the proof 

Analogonsly, we omit g",g' for aj,Thns, a\ = and = ytVi.g'.,- 

We check the following cases: 

1 : 7 ^ i.e. there are no restrictions by J^. Hence, the most likely 

path combines the most likely path through g", g' with the most likely 
label at arc (g',g): 

0^4,9".9',9 = 0^42/4+1,71^1 


a: Ct,q",q' 7?+i (®oo Figure 7(a)). There are no restrictions snch that 
the second most likely path is 


04+1,9",9',9 ~ '^i 2 /t+ 1 . 7 iAi' 


b: = 7 ^^_^ (dotted combination in Figure 7(b) I. The suffices 7 }_|_^ 

and iCtq" q'jlt+i) not allowed in n(t + 1, g',g). Thns, 


,q' ,q 


= max 


{o{ 


2 /i+l, 7 t\i ’ 042/4+1,72^ 


.}■ 
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Qfj -^ 2^t+1.7tVl 


if 2^t+1.7?+l 

(a) Caseflm 


af - -^ J/t+i.7tVi 



Figure 7: Subcases of 111 Combination of ctf^i ^i, ^ dashed, other forbid¬ 

den paths are dotted, solid arcs denote possible combinations to calculate 

q ;2 



yt+i,ji+i 


2/t+i.7 


2 

t + 1 


(a) Case [2a] 



Figure 8: Subcases of ^ Combination of ^ dashed, other forbid¬ 

den paths are dotted, solid arcs denote possible combinations to calculate 

'^t+l,q" ,q’,q- 


2: Thus, due to it is not allowed that consecutive arcs 

read the same label at consecutive positions. The most likely path from 
n(t -I- l,g',g) through g",g' and g combines either the most likely path 
from II(t,g",g') with the second most likely label at position t -|- 1 or 
the second most likely path from n(t, g", g') with the most likely label at 
position t -I- 1: 

^t,q",q',q “ 2 / 4 - 1 - 1 , 72 ^^ 5 Ckt 2/t+l,7jCi } 


a: Ci 


t+l,q",q',q 


= 74^+1 (dashed combination in Figure 8(a) I. The only 
restriction is not to read such that the second most likely path 
is simply 


^t+l,q",q',q — 


b: Ct+i,q",q',q — Tt+i (dashed combination in Figure 8(b) I. Hence, the 
suffices {Ct,q",q': It+i) are forbidden. 


a: 


t+l,q”,q',q = 


.7?+i} 


This completes the proof. 

□ 
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Proof of Theorem^ Note, that the approximation is exact for arcs reading two 
or less labels since in this case cont{t, q', q,i) and cont(t, g', g, i) maximize over 
the same paths. Especially, NaC-transitions are always exact since they read 
only one label. 

Let IT* be the most likely feasible path with respect to the regular expression 
r and let 


t 

cont(t,g',g,i) = yt' 

t '^1 


Assume TrJ" G E. Due to Assumption 


for i G {1,2}, i.e. 

The likelihood of (tt}, ..., is equal to al_ 
greater than 2 since otherwise ^ {jt-i 
the substitution of TTt_i by * would yield a feasible path with greater likelihood 
due to condition This contradicts to the assumption that it* is maximizing 


t-i q' q loi' some j. j cannot be 
,q',q^l?-i,q\q} i^ee Figure |7(a)|) and 


the likelihood of all feasible paths. 
i < 2. 


Thus, we only need to compute a 


t,q’,q 


for 


If TT* ^ {7q9'.g,7q9'.g,7q9'.g}, we get a feasible, more likely path by substi¬ 
tuting ttJ' by *. This new path collapses to the same word. Again, this is a 
contradiction to the maximum likelihood of tt* . Thus, we only need to consider 
the three most likely labels per arc. 


□ 
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